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1. INTRODUCTION 

Reliable forecasts enable tracking the loads relative to proper balancing and creation of a dynamic 
energy pricing model and trading opportunities for energy users, using the knowledge of their anticipated 
power needs. Load forecasting is very useful in scheduling of devices [1], and energy trading that is 
becoming the centerpiece of a developing energy revolution. To analyze power consumption trends and to 
characterize patterns and develop forecasts, various statistical and traditional methods [2], [3] are used. 
However, modeling a complex real-world problem, such as power forecasting, with statistical linear models 
like autoregressive model (AR), autoregressive moving average (ARMA), autoregressive integrated moving 
average (ARIMA), and seasonal autoregressive integrated moving average (SARIMA), is often difficult. 
These types of models cannot determine non-linear relationships in complex data, such as power 
consumption data with stochastic nature, therefore complex models, perhaps based on machine intelligence 
like neural networks, provide the analysis leverage necessary. Statistical tools from some industrial players 
like Prophet [4] from Facebook and Uber [5] that won the M4 Competition achieved some level of success 


Journal homepage: http://ijai.iaescore.com 


Int J Artif Intell ISSN: 2252-8938 O 1005 


because of the methodology likened to the use of dropout and its invariants in approximating a well-known 
probabilistic model, the gaussian process in neural networks. In contrast to statistical modelling, neural 
network models formulate a model based on features learned from existing data and this dependency makes 
them data-driven and self-adaptive, essential aspects for time series forecasting and where Big Data is 
involved. Although neural networks are preferable in most time series problems, they are not without their 
limitations. Problem of large number of trainable parameters sometimes makes neural networks models 
unimplementable in low-processing devices. For an instance, AlexNet, which won the 2012 ImageNet 
challenge, has about 60 million trainable parameters and VGGNet has a huge 138 million parameters. 
Although there has been continuous effort towards trainable set size reduction and overall performance 
optimization, more efforts are still needed. For instance, SqueezeNet was able to reduce its trainable 
parameters to 1.2 million while achieving a reasonable performance. These model size reduction efforts are 
important because real-world problems, including power consumption forecasting, requires real-time and 
on-device processing. It is not enough to have an accurate prediction model without ability to operate on 
resource-constrained low-power edge device without latency problem. Experimentally illustrated facts have 
shown that the model size affects its inference time [6], so the smaller the model size the faster the 
computational speed. 

More recently, neural network methods have become very popular in time series forecasting due to 
the high performance achieved. Implementations in the form of deep learning algorithms have also become a 
turning point for both classification and regression tasks which, hitherto, have been difficult even on 
computers with excellent performance. Applying neural networks solution usually require training of large 
amounts of data to realize an appropriate machine learning model that can effectively be used in making 
projections. Given this, the model size obtained is normally big, requiring lengthy processing time. 
Therefore, a model compression technique is necessary to reduce the size and to expedite the computational 
process. Importantly, learning the arbitrary complex mapping from inputs to outputs has become the focus of 
research from which significant performance improvements have been achieved. However, a huge gap still 
exists between the methods of deployment and the implementation environment. Some of these gaps include 
a means to: capture the dominant factors in the data that need to be learned, as well as reducing the size of the 
model, increasing its inference time, and the selection of the model’s parameters. These are the major areas 
that the proposed forecast model, discussed in this paper, aims to optimize. 

Complex models based on deep learning, such as SGtechNet proposed in this paper, stand a better 
chance of addressing most of the noted difficulties of a complex real-world problem like power forecasting. 
It is intended that this model will be implemented in a low-powered-low-memory on-device mobile system, 
enabling smartphones to be used for demand-side energy management and control. It has been observed that 
the availability of high-speed graphics processing units (GPUs) in labs gives greater performance for models 
with larger trainable parameters, but these models are unusable in many real-world applications especially 
when implemented on resource-constrained devices. Achieving a lightweight model with a very high 
confidence in the predictions, was a major objective of our work. Based on this, an ensemble method together 
with advanced feature representation was used in combination with other improvement methods such as the 
layer compression technique to leverage improved forecast results. Many methods have yielded good model 
performance results but, in our work, we are more concerned on the scalable methods capable of optimizing 
the model training for quick convergence. aggregated deeep belief networks (DBNs) outputs using the 
support vector machine (SVM) algorithm, reported in [7], outperformed benchmark methods such as support 
vector regression (SVR), feedforward neural networks (FFNN), DBN and ensemble FFNN. The model 
compression algorithm implemented in the current work addresses the challenges of cost, power, heat and 
other related issues, all of which will be elaborated in the methodology discussion. 


2. THE NETWORK ARCHITECTURE 

Our proposed architecture as illustrated in Figure 1 is centered on optimizing neural networks 
learning process and mitigating its inherent challenges while achieving state-of-the-art forecast model. A 
weighted average ensemble method using multiple models with similar configurations, but different initial 
random weights is proposed. Those various models were trained on 3 different datasets including two load 
demand datasets from a household in France and the one from the smart office of SGtech, Naresuan 
University Thailand. However, combining predictions from multiple models can also add a bias that can 
make the model less sensitive to specifics in the training data, choice of training scheme and the serendipity 
of single training. It has been observed over time that ensemble methods, if not properly checked, might not 
ensure that the best-performing set of weights are used as a final model. So, our proposed method performed 
weighted average ensemble [8] as one of the ways of achieving a model ensemble in neural networks like 
voting [9] and stacking [10] and snapshot or checkpoint [11], among others, in a unique way. Here, instead of 
allowing equal contribution of all the models to the final prediction model, contributions were dependent on 
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the level of trust and estimated performance, to ensure the performance of poorly performed models do not 
affect the overall forecast result. This method not only reduces the variance of predictions, but also reduces 
the generalization error. 
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Figure 1. Proposed neural network forecast model 


Aside from model improvement, the design for SGtechNet feature learning made it adaptable to 
different datasets including augmented power consumption dataset [12] from an automated office in such a 
way that it detected and analyzed the atmospheric climate changes. In our development process we 
considered the weather conditions all year round. To ensure that the real-time power consumption data used 
for both augmentation and validation of the model’s performance captures this fact, we juxtaposed the power 
generation capacity of test environment Thailand on the load factors based on urban and rural 
characterization discussed in [13] to test if climate changes have any effect on the characteristics of 
household electricity consumption. Load factors, seasonal factors, and utilizations factor are some of the 
usage characteristics relevant to the power consumption of air conditioners, fans, refrigerators, water heaters 
and even washing machines and clothes driers. For example, especially in the case of the latter three 
domestic appliances, heating water or drying laundry may not in fact be necessary in a climate such as is 
experienced in Thailand, whereas it could be a significant use of power in cold climates. Table 1 describes 
Thailand’s 2020 power statistics showing the monthly power generation capacity and load factors. 
Juxtaposing the generation capacity with the load demand as illustrated in Figure 2, sourced from [14], 
showed that the load factor surpassed generation capacity in March and September. This indicated the need 
to ensure that the validation data for the proposed model was tested on across the different seasons of the 
year. Also, the result of the preliminary analysis of weekday power consumption and generation/load demand 
discrepancy shown in Figure 3, emanating from data from the smart office that was used for the validation of 
the proposed model, showed the daily power consumption characteristics. These daily characteristics proved 
useful in determining the performance of the model. 


Table 1. Thailand power statistics 2020 [14] 


Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. 
GENERATION 
(GWh) 16,138 15,477 17,618 =15,715 16,899 15,887 16,390 = 16,348 = 16,195 15,457 15,292: 14,483 
LOAD 
79.1 82.0 82.7 78.7 80.2 81.0 82.0 80.7 82.8 79.5 774 75.1 
FACTOR (%) 


A description of time series modeling methods used by deep neural networks, for power 
consumption forecasting, has been introduced previously, together with discussion of the various methods 
identified in the literature. The organization of this paper includes, in section 2, related work, then the 


Int J Artif Intell, Vol. 11, No. 3, September 2022: 1004-1018 


Int J Artif Intell ISSN: 2252-8938 O 1007 


experimental and development methodology in section 3. section 4 presents the experimental results and 
discussion, and the paper is summarized in the conclusion. 
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Figure 2. Annual power generation against load 
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Figure 3. Daily power consumption 


3. METHOD 

Modeling power consumption of a smart home is very challenging due to its stochastic nature and 
non-linear relations over time. Given the sequence by sequence nature of the multivariate dataset used in this 
model, where an input sequence time (x = x1, X2,°'- , Xr) with x,3 Rn, and n is the variable dimension. Our 
objective was to predict the corresponding outputs (y = yj, 2,°-- , Y,) at each time step. The expected result 
of this type of sequential modeling network is to obtain a nonlinear mapping of input sequence (x) to the 
prediction sequence (y) through optimization from the current state as: 


h 
On Var Yn) = Fe Xa Xr) (1) 
Also, considering the neural network and its weights, the distinct forecast output gives: 
Vi = f Lik Wixi + 3; (2) 


where x; is the input to the neuron, w; is the weight of the network, b; is the bias in the network, f() is the 
nonlinear function, while y; is the output. Therefore, our objective is to develop a network architecture 
capable of optimizing the mapping process. The development process started with the framing of the type of 
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prediction we are interested in, considering the available datasets, and proceeded to how the network could 
be trained and validated and finally ended with the performance evaluation. 

However, during the data preprocessing stage, we noticed non-stationarity and seasonality trends 
due to the spatiotemporal factors contained in the power consumption datasets. This prompted the decision to 
apply a different approach to modeling power consumption behavior for reliable forecasting. Statistical 
methods and neural network combinations [15]-[18] have been applied to regression problems of this nature 
with good results. Ordinarily, a stochastic method approach would have been the easiest to apply, particularly 
for power consumption forecasting, if not for its error susceptibility and inflexibility [1], [18], [19] 
implemented different types of neural networks for time series problems. However, as we are interested in 
predicting a week ahead horizon, we started our experimentation using the previous 7 days power demand as 
the input vector x;, and the next 7-steps ahead as y; in our adaptive algorithm and continued varying the 
timesteps upwardly based on the hypothesis of the more the timesteps the better the prediction. Additionally, 
power consumption dependencies such as weather, calendar events (holidays, family social events, festival 
days and so on), other factors such as geographical locations, human comfortable temperature, 
heating/cooling technology, and type of consumers or purpose of electricity use industrial or residential, were 
included as additional lagged features to assist our model learn the data better. 

This forecast method was implemented on deep learning encoder-decoder networks. Dropout, which 
have been a common technique in model regularization, were used to block out a random set of unit cells 
during model training to avoid overfitting. In (3) expresses the way in which this proposed model accepts 
multi-variant time series input variables and output 7 distinct forecasts ahead. The input parameters are the 
previously observed data at the scale time (t+y-1, t+y-2 ... t). Therefore, the answer to finding the relationship 
between the input and output data) for the purpose of predicting the future data at the time (t+p) lies in the 
nonlinear functional mapping from the past observations of the time series to the future value, calculated in 
(3) and using (4). 


y= fe Yt-20 ++ Vt-p» W)+ & (3) 


where w is a vector of all parameters and f is the function determined by the network structure and the 
connection weights. 

Using a simple feed-forward neural network architecture with 3-layers, for example, the output of 
the model can be computed as: 


Yi =A + ye a 9 (Bo + a Bij Ve-1) + &nV @) 


At the instances y;_; (i = 1, 2,3,...,p) are the p inputs and y; is the output and p, q are the integer 
values of the number of input nodes and hidden nodes respectively, while a; (j = 0,1,2,...,q) and Bj (i = 
0,1,2,...,p;j7 = 0,1,2,..,q) are the connection weights, and €, is the random shock, @p and Boj are the bias 
terms. For activation of this type of model, nonlinear activation functions such as the logistic sigmoid 
function or similar, such as linear, gaussian, hyperbolic tangent and so forth can be used. However, the 
estimation of the connection weights as a measure for minimizing the error function in this network can be 
done using the nonlinear least square method of (5). 


FY) =e? =r - jy? (5) 


In (5) applies an optimization technique for error minimization, where ¥ is the space of all connection 
weights. 


3.1. Dataset 

Because of power consumption correlation to previous load consumption historical data and 
consumer behavior [20], this research leveraged secondary data from [12] that was augmented with 
remote-sensing data acquired from SGtech Smart Office. This secondary data was a multivariate time series 
dataset containing 2,075,259 measurements gathered from a house located in Sceaux, France, between 
December 2006 and November 2010 (47 months), recorded in real-time. The observations were made every 
minute and the temporal data captured the consumption behavior across different seasons of the year and 
weather conditions. Given that SGtechNet is interested in modeling power consumption behavior of a typical 
smart home, where all appliances are automated, we therefore validated the model performance with 
real-time data from a smart office. Our Smart Office data were collected through a smart means where 
devices in the automated office were configured to transmit data in real-time to a smart meter to enable 
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profiling of each individual appliance, and their power consumption, and as well serving the purpose of 
power-quality monitoring. Figure 4 showed the ditributions of the variables, and we later added one 
additional variable, Sub_metering 4 as shown to Figure 5, to the original 7 independent variables that 
comprised the original dataset secondary data from [12]. This represents active energy for electric vehicle 
(EV) charging and other miscellaneous energy needs that were not accounted for in the original dataset. In 
the model design, additional features like weekend and weekday as shown in Figure 3, were added because 
Total Active Power consumed changes very much for weekdays and weekends. 

Both datasets were split and 75% of each was used for model training, with the remaining 25% used 
for validation. The variables are obviously time dependent and can easily be influenced by changes in the 
weather. However, the unique characteristic of the weather suggests that location is an important determinant 
of a method to be applied in power forecasting. Location variation can invalidate the potency of a successful 
method when it is applied in another location with different weather and ambient characteristics. Therefore, a 
use case scenario [13] that characterizes power demand in urban areas, and rural areas across Thailand, was 
used for easy determination of likely energy demand in each category and to consider their various power 
consumption behavior. By this idea, the result of this predictions can therefore be compared with other 
predictions applicable to different locations and based on similar characterization. 
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Figure 4. Plot of dataset variable distributions 
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Figure 5. Individual distribution of the attributes 


3.1.1. Date pre-processing 

This public dataset was cleaned, and imputation method used to fill all missing and corrupted values 
using a day-wise last observation carried forward (LOCF) technique. This simply means carrying an 
observation from the same time the previous day. In a time-series data of this nature with seasonality trend, 
other methods like linear interpolation, seasonal adjustment + linear interpolation could also be applied. 

From Figure 4, it can be noticed that voltage seems to have a gaussian distribution where as rest of 
the data seems skewed (i.e., non-symmetric), necessitating power transformation of the data before 
modelling. Exploratory analysis further showed that global active power expected to be predicted has 
strongest correlation with global intensity with a factor of 1. Therefore, this paper further investigates the 
extent each input variable affects the outcome of the prediction result of the global active power. 


3.2. Model configuration 

The architecture of the network has 7 input dimensions with 1 output layer, 3 convolutional and 
hidden layers each. This architecture consists of combination of convolutional neural network (CNN) and 
long sort-term memory (LSTM) deep networks. While the input transformations and feature representation 
take place in the convolutional layers, the resulting output is convolved and read into fully connected LSTM 
unit. Since the input data is a 1-D sequence, it was easy for the interpretation over the number of time steps. 
The LSTM has 3 hidden layers with 4 gates that handles updates and memory functions of the network. As 
the gates receives both the input output from the last convolutional layer obtained at previous time step 
(Ay_,) and the related current time (x,) the forget gate takes x, andh,_,; as input to determine the 
information to be retained in cell state (C1) using sigmoid layer. c, and c,_, denotes cell states at timesteps ¢ 
and ¢ —/ respectively. The value of C; is therefore determined by the input gate i, using x, and h,_4. 
However, the function of the output gate is to regulate the output of LSTM cell based on c; using both 
sigmoid layer and tanh layer. 


3.2.1. Network training 

The network is trained to forecast the next consecutive 7 days a week ahead time steps using the 
learned features. Those additional features introduced during model design for the purposes of augmenting 
the data are concatenated to the vector and passed to the final prediction. Because ensemble method was used 
to ensure a better generalization, global optimization was consequently performed on the ensembled models 
to find the best coefficients for the weighted ensemble. The result of this optimization determines the 
individual contributions of the weight of each ensemble method to the final prediction. 


3.2.2. Prediction/evaluation 

Since various factors including atmospheric climate domain factors are some of the determinants of 
power consumption differences experience across different locations, SGtechNet analyzed those factors. 
Diverse atmospheric climate differences across locations prompted the need to validate the performance of 
this model using multivariate datasets collated from different locations France and Thailand precisely. 
Basically, to determine the effect and influence of climatic factors relative to performance for proper 
comparison with other forecasting methods. Therefore, series of experiments is conducted at different 
timesteps with the same model configuration to increase the confidence in the prediction and validity for 
future studies. To ascertain the effectiveness of this weighted average ensemble method against the backdrop 
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of the limitation of poor performance resulting from allowing equal contributions from ensemble members to 
the final prediction model especially when some of the models are bad. And mitigate against the drawback of 
lengthy preference ordering calculation of individual ensemble members which often results to higher 
computational complexity in some ensemble techniques like voting [21], two predictions were considered: 
using different numbers of ensemble members in increasing levels of complexity across different timesteps 
and model averaging method. We started with 10 ensemble members whose contributions to the final 
prediction model is based on their confidence level and kept varying the numbers until we reached a 
standalone. It was discovered that there were no discrepancies in error when the number of ensemble 
members were varied. However, a significant discrepancy is reported using model averaging method where 
equal contribution from ensemble members was allowed. The prediction performance of the proposed model 
is computed based on root mean square error (RMSE) and compared against mean absolute percentage error 
(MAPE) and mean absolute error (MAE) errors see Figure 6 over averaging ensemble method and standalone 
method. These metrics are the most used performance measures for time series analysis because the error is 
of the same unit with the predictions and their errors can range from 0 to 0. Figure 7 shows the validation 
loss across different timesteps (7, 14, 21 and 28). Walk-forward validation scheme was implemented, where 
the model made 1 week prediction, then utilized the actual data for the week or 2 weeks as a basis for the 
predicting the subsequent week. 
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Figure 7. Validation loss across different timesteps 
3.3. Encoder-decoder-network 
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We considered advanced feature representation methods, such as encoder-decoder, to preserve the 
hidden abstractions and invariant structures in the time series input. These have been previously applied in 
both reinforcement [22], supervised [23] and unsupervised learning. This unsupervised neural network 
method is designed for the adaptive learning of the long-term dependency and hidden correlation features of 
multivariate spatiotemporal data and was trained to reconstruct its own input in each layer as its output which 
is used as the inputs of the successive layer. In this paper, an encoder that extracts useful representative 
features from the time series input data was trained in such a way that the decoder could conveniently 
reconstruct those features from the encoded space. Specifically, the output of the convolutional layers is 
concatenated by Conv2D followed by LSTM layers, as achieved in [5], [24] to capture all the inherent 
spatiotemporal correlations in the time series input data. This proposed ConvLSTM encoder-decoder 
architecture has 2 sub-models: one for reading the input sequence and encoding (i.e., mapping the 
variable-length source sequence) this sequence into a fixed length vector, while the second part decodes the 
fixed-length vector and outputs the predicted sequence (i.e., mapping the vector representation back to a 
variable length target sequence). This output of the decoder represents the learned feature. Thereafter, a dense 
layer is used as the output for the network, and it uses the same weights by wrapping the dense layer in a time 
distributed wrapper function used in the network. 


3.4. Model compression 

On-device systems are resource-constrained, with limited memory and low computing power. 
However, deep learning algorithms are computational and memory intensive, so they cannot be implemented 
on real-world applications or other resource-constrained systems without difficulties. As deep learning 
models goes deeper in layers their inference time increases along with the increase in number of trainable 
parameters; making it difficult to be deployed on resource-constrained devices. By the parsimony concept, 
models with a smaller number of parameters are more likely to provide adequate representation of the 
underlying time series data, but models with a high number of trainable parameters requires more energy and 
space and are likely to overfit during training. Consequently, compression technique, as presented in [25], is 
required to allow the deployment of a large model on resource-constrained devices. Table 2 summarized the 
results from literature on the most recent efforts towards model size and trainable parameter reduction 
leveraging on different techniques in comparison with SGtechNet. This comparative analysis shows that 
SGtech has the least number of trainable parameters with a very considerate model size, hence the 
justification for its suitability for low-power-low-memory devices. Model size is very important as far as 
performance optimization of on-device system is concerned because larger models mean more memory 
reference and more energy [26]. 


Table 2. Model parameter comparison 


Model Parameters Size Training Time _ Inference Time 

ENet [27] 0.37 M 0.7 MB 15mins 383ms 
LEDNet [28] 1.856 M 3.8 MB - - 

SegNet [29] 29.46 M 56.2 MB 37mins 286ms 
AlexNet [30], [31] 60 M 232 MB 7,920mins - 
VGG16 [31], [32] 138M 528 MB - - 
SqueezNet [25] 0.66 M 4.8 MB - - 
ResNet152 [31] 232M 60 MB - - 
GoogleNet [31] 6.8 M 28 MB - - 

SGtechNet (Proposed) 128K 4.93 MB 1.3mins 3ms 


Therefore, to fit the SGtechNet model on limited resourced devices, enabling the model to be usable 
in real-world applications, the SqueezeNet [25] concept of was used, with the modification that the 1x1 and 
1x3 convolution filters were used for feature representation. As each kernel receives an input time series, the 
corresponding outputs are concatenated and followed by convolutional-LSTM layers which capture the 
long-term spatial patterns in the electricity consumption data. This method not only reduces input data 
dimensionality but also reduces the complexity of the data [33] leading to an improved result even though a 
marginal cost burden is incurred due to a slight increase in number of parameters. However, the choice of a 
smaller filter reduces the models inference time. Also, SqueezeNet has almost the same accuracy of AlexNet 
with its compression of trainable parameters, but that accuracy is a little lower than GoogleNet. SeNet [34] 
developed an architecture that recalibrates channel-wise feature responses and uses them to determine the 
interdependencies existing between two channels. Channel-wise scale and element-wise summation 
operations were combined into a single layer “AXPY ”using skip-connections. This resulted in considerable 
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reductions in memory, cost, and computational burden. It is imperative to note that the application 
environment of most of the state-of-the-art models in Table 2 is image classification and detection, so for 
SGtechNet to achieve a RMSE Error of 358kwh in a regression task like power forecasting shows high level 
of robustness. Even though training and inference time for some of these models compared with SGtechNet 
was not reported in the literature, the few ones that were reported clearly put SGtechNet at advantage in 
terms of computational complexity. 


3.5. Feature representation 

Feature learning or representation learning in machine learning is a set of techniques that allows a 
system to automatically discover the representations needed for feature detection or classification from raw 
data. Figure 8 shows the feature learning process. This is a method of finding a representation in each data 
the features, the distance function, and the similarity function-dictates how the predictive model will perform. 
Feature representation helps to reduce data complexity, so the anomalies and noise can be reduced. It also 
helps in reduction of the dimensional of input data, making it easier to find patterns, anomalies, and also 
provides a better understanding of the behavior of the data generally. Because our time series input data is 
1D, a smaller kernel filters (1, 3) were used in the convolutional layers for feature learning. 

Considering the spatiotemporal nature of power consumption variables, a state space representation 
of (6) represents the transition process expressing the discrete stochastic behavior of the variables and (7) 
represents the likelihood of the observations with the assumption that states are part of the model parameters. 


Pin, = OU (6) 
Y, =f, 0) +V; (7) 


where u, is the process noise, v; is the measured noise. 
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Figure 8. Feature learning process 


3.6. Ensemble method 

The stochastic nature of power consumption varying with season and time necessitated the use of a 
stochastic learning algorithm for dataset training. However, the neural network algorithm has the inherent 
limitation of randomness which results in a different final model each time it is trained on the same dataset. 
To address this limitation, an ensemble method of [7] with a weighted average of different trained models is 
used for prediction. Ordinarily, the model ensemble method allows each model an equal contribution to the 
final prediction which could sometimes be seen as a limitation when the contribution from poorly performed 
models to the final model jeopardizes the efforts of a well performed model. However, the contribution to the 
final model in this proposed model is purely dependent on the model’s trust and estimated performance, 
resulting to an improved overall prediction result. 

Sensitivity analysis was carried out to determine the number of ensemble members most appropriate 
for the forecasting problem and how impactful they could be to the test accuracy. To determine 
trustworthiness of ensemble models and to estimate performance, we need to find their weights. However, 
due to there being no analytical solution to estimation of values for the weights, we used gradient descent 
optimization with a unit norm weight constraint on the holdout validation set rather than on the training set. 
Ordinarily, a simpler way of finding each ensemble member’s weights would have been to grid search values 
but because our holdout validation is large enough, gradient descent optimization becomes the best option. 

This optimization procedure sums up all the model vector of weights to 1 i.e., W1, W2,°::,Wx = 1, 
also constrains them to positive values to allow weights to indicate the percentage of trust or expected 
performance of each model. The optimization process utilizes the set of information provided to it to search 
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for weights with lower errors under defined bound (i.e., 0.0-1.0) amongst 10 ensemble members until 
convergence. But, before performing weight optimization, 10 single models were created, and their 
individual performances were evaluated on the test dataset. For the optimization, a differential-evaluation 
function was used to search and display the optimal sets of weights after several iterations which returned the 
score to be minimized and retrieved the best weights, with their performance being reported on the holdout 
validation data. Optimal weights of the base learners are aggregated to find the best tradeoff between bias and 
variance and minimize the prediction error. So, each base learner’s prediction (¥;) on holdout validation set, 
therefore gives: 


Min Error (Wy 91 + W2Pot--+, WD V) (8) 


such that pi W; = 1, when W; > 0 V; = 1,...k, where W,represents the weights corresponding to base 
model j (j = 1,..,k), y is the vector predictions of base model j, and y is the vector of true value. So, at any 
instance of training the base learner j, weights W; is computed from optimization (4) on the assumption that n 
is the total number of instance, y; as the true value of observation i, 9;; as the prediction of observation i by 
base model j. 


ee! k a 
Min (SE -— DWI) os 


Such that )¥_,W; = k,when W, >0V; = 1,...k 

The ensemble member contributions are evaluated based on those chosen weights. This process not 
only improve model performance but also saves time. Ordinarily, the search for such weights with lower 
error values would need to be done randomly and exhaustively, which is time demanding. 


3.6.1. Comparing weighted ensemble and model averaging method performance 

Table 3 shows the results produced by the weighted average ensemble method, which demonstrate 
that this method outperformed the model averaging method for individual ensemble members even though 
their processing time variation is insignificant. Furthermore, the model’s performance is compared with 
baseline model see Table 4 using both secondary and primary datasets acquired from two different 
continents. The importance of this comparative analysis is to provide completeness of this study analysis as 
regards the major limitation of ensemble technique which is misleading assumption that all ensemble 
members are equally effective. 


Table 3. Comparative analysis of weighted ensemble models and model averaging method 


Statistics Weighted Ensemble Models _ Model Averaging Method 
Number of Iteration 1,000 1,000 
Validation Time 2.053s 2.185s 
Average RMSE 358kwh 362.617kwh 


Table 4. Comparative analysis of weighted ensemble models and baseline model on different datasets 


Model Statistics | Training on HHPC Dataset France Training on Real-Time Dataset, SGtech, Thailand 
Propose Model Baseline Model Propose Model Persistence Model 
Hourly Daily Weekly 
Training Time 114.109s - 78.916s - - - 
Prediction Time 2.282s - 2.053s - - - 
RMSE 3.61.885kwh 465.294kwh 358kwh 480.246kwh _469.389kwh _465.294kwh 


4. RESULTS AND DISCUSSION 

A real-time experimentation using Google Colab TPU and one of the finest neural network APIs, 
contained in Keras® with its backend TensorFlow produced the results shown in Table 3 and Table 4. Based 
on the performance evaluation of the model, this model significantly outperformed the baseline model. 
Though an unstable training trajectory was experienced during training, which could be likened to overfitting 
in the training data, the overall performance is good. In the model’s evaluation result of Figure 9, RMSE was 
found to statistically differ across the 7 days of the week as shown in Figure 9(a) while Figure 9(b) showed 
how the training error decreased sharply after commencement of training, before it became linear, due to the 
model’s complexity: likewise, the validation error. A squeeze layer technique, adopted from [35], reduced the 
size of the model to 4.9 M without affecting its performance, making it implementable in a low-power-low- 
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memory device; Smartphone, iPad, Tablets. One of the limitations of the model performance enhancement 
method being discussed in this paper is that, as the model size is reduced, the number of parameters slightly 
increased, resulting in a marginal increase in resources usage relative to implementation. Therefore, further 
work is proposed to develop a systematic method of reducing the model size without necessarily increasing 
the number of model parameters. 


4.1. Comparative analysis 

Evaluation of this model’s performance was against the baseline model and other alternative 
forecasting methods even though some metrics, such as computational speed and prediction time, were not 
captured in all the literature reviewed. We also analyzed power consumption datasets used in validating 
SGtechNet along daily consumption cycles time and day as shown in Figure 10(a) and 10(b) respectively, for 
clearer understanding of residents habits. An experimental framework for the empirical comparison of 
different model performances, based on varying test conditions, was introduced. Uniqueness of weather 
characteristics in different locations indicated that there is no guarantee that a forecasting method that is 
successful at one location would be effective at a different location. The inclusion of this framework in the 
design accounts for diverse climatic conditions and created a valuable environment for future studies in 
emerging forecasting technologies. This increases the confidence in the observed results by allowing the 
validity of the forecasting algorithm to be tested on both the test set from France and the test set from SGtech 
Naresuan University Thailand, both of which are real-time data. Alternatively, to prove that the improved 
processing time and other improvements achieved in this model are due to pure scientific contributions rather 
than software and hardware differences, we experimented on different technologies. We compared the results 
when using an NVIDIA GeForce GTX1080 TI GPU/TPU enabled TensorFlow against those achieved when 
using an NVIDIA Tesla K80 GPU running on the Ubuntu Server 16.04.3. The discrepancy in the results was 
found to be scientifically insignificant. The result of this model is further compared with model averaging 
and standalone methods as shown in Table 5. SGtechNet model size is 4.93 MB which means it can easily be 
put in an on-chip Static random access memory (SRAM) cache. 
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Figure 9. SGtechNet performance evaluation result showing (a) RMSE across the 7 days power consecutive 
days forecasted and (b) Model’s training and validation loss 


Table 5. Comparison of the experimental results of proposed model against some existing power forecast 


methods 
Model Statistics Models 
Propose Model __ Persistence Model _ Model A [36] _ Model B [37] 
Training Time 114.109s - - - 
Prediction Time 2.282s - - - 
Size 4.935MB - - - 
RMSE 358kwh 465.294kwh 530kwh 450.5kwh 
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Figure 10. Shows the plot of power consumption across different (a) Time and (b) Weekday 


5. CONCLUSION 

Nationwide lockdown due to covid-19 pandemic is causing a rise in domestic power consumption, 
making energy conservation, and planning more relevant than ever. In our research, we demonstrated the 
effectiveness of combining atmospheric climate domain knowledge of factors determining power 
consumption differences based on location, with empirical data captured from automated systems for future 
energy forecasting. This forecast model SGtechNet developed to optimize the data learning and prediction 
process leveraged on a multivariate dataset to make a multi-step time series 7 days ahead forecast. 
SGtechNet, is based on ConvLSTM-Encoder-Decoder algorithm explicitly designed to optimize the quality 
of spatiotemporal encodings throughout the feature extraction process. The validation report of this model 
showed a significant improvement on the forecast result when a real-time dataset from an automated office 
was used for model validation which was compared against a manually operated home/office represented by 
the secondary data. This implies, aside from the social behavioral factor that propels the users ’choice of time 
of use (ToU) electricity, that environmental and real-time control factors are also contributory factors that 
determine the consumption rate and therefore cost of power that is consumed domestically or in an office 
workplace. The RMSE of 361 kwh recorded was compared with 465 kwh on the persistence model and an 
improved RMSE of 358 kwh was achieved when validated in holdout validation data from the automated 
office. Overall performance on error rate, forecast time and inference time were later compared with 
published research, and the comparison showed that our model, the SGtechNet, provided significant 
improvements in these factors. One of the most significant achievements of SGtechNet is its adaptiveness to 
other forecast problems and different datasets in such a way that it detected and analyzed the atmospheric 
climate changes over different locations. 
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